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Abstract. Tiie CS A-ES is an Evolution Strategy witii Cumulative Step size Adap- 
tation, where the step size is adapted measuring the length of a so-called cumu- 
lative path. The cumulative path is a combination of the previous steps realized 
by the algorithm, where the importance of each step decreases with time. This 
article studies the CSA-ES on composites of strictly increasing functions with 
affine linear functions through the investigation of its underlying Markov chains. 
Rigorous results on the change and the variation of the step size are derived with 
and without cumulation. The step-size diverges geometrically fast in most cases. 
Furthermore, the influence of the cumulation parameter is studied. 

Keywords: CSA, cumulative path, evolution path, evolution strategies, step-size adap- 
tation 

1 Introduction 

Evolution strategies (ESs) are continuous stochastic optimization algorithms searching 
for the minimum of a real valued function / : M" JR. In the (1,A)-ES, in each 
iteration, A new children are generated from a single parent point X E M" by adding a 
random Gaussian vector to the parent, 

X e M" ^ X + aAf{0, C) . 

Here, a E M!J_ is called step-size and C is a covariance matrix. The best of the A 
children, i.e. the one with the lowest /-value, becomes the parent of the next iteration. 
To achieve reasonably fast convergence, step size and covariance matrix have to be 
adapted throughout the iterations of the algorithm. In this paper, C is the identity and 
we investigate the so-called Cumulative Step-size Adaptation (CSA), which is used to 
adapt the step-size in the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) 
[ , ]. In CSA, a cumulative path is introduced, which is a combination of all steps the 
algorithm has made, where the importance of a step decreases exponentially with time. 
Arnold and Beyer studied the behavior of CSA on sphere, cigar and ridge functions 
[1,2,3, ] and on dynamical optimization problems where the optimum moves randomly 
[5] or linearly [ ]. Arnold also studied the behaviour of a (1, A)-ES on linear functions 
with linear constraint [ ] . 

In this paper, we study the behaviour of the (1, A) -CSA-ES on composites of strictly 
increasing functions with affine linear functions, e.g. / : a; i-> exp(a;2 — 2). Because 



the CSA-ES is invariant under translation, under change of an orthonormal basis (ro- 
tation and reflection), and under strictly increasing transformations of the /-value, we 
investigate, w.l.o.g., f : x xi. Linear functions model the situation when the current 
parent is far (here infinitely far) from the optimum of a smooth function. To be far from 
the optimum means that the distance to the optimum is large, relative to the step-size a. 
This situation is undesirable and threatens premature convergence. The situation should 
be handled well, by increasing step widths, by any search algorithm (and is not handled 
well by the (1, 2)-(tSA-ES [9]). Solving linear functions is also very useful to prove 
convergence independently of the initial state on more general function classes. 

In Section 2 we introduce the (1, A)-CSA-ES, and some of its characteristics on 
linear functions. In Sections 3 and 4 we study 111(0-4) without and with cumulation, 
respectively. Section 5 presents an analysis of the variance of the logarithm of the step- 
size and in Section 6 we summarize our results. 

Notations In this paper, we denote t the iteration or time index, n the search space 
dimension, A/'(0, 1) a standard normal distribution, i.e. a normal distribution with mean 
zero and standard deviation 1. The multivariate normal distribution with mean vector 
zero and covariance matrix identity will be denoted A/'(0, /„), the i*'^ order statistic of A 
standard normal distributions Ni;\, and its distribution. If a; = (xi, • • • , x„) G M" 
is a vector, then \x\^ will be its value on the dimension, that is [x]^ = Xi. A random 
variable X distributed according to a law C will be denoted X ^ C. 

2 The (1, A) -CSA-ES 

We denote with Xt the parent at the i*'' iteration. From the parent point Xf , A children 
are generated: Yt,t = Xt+crt|j , with i e [[1, A]], and ^ - A/'(0,/„), {it,i)^<^[[lM 
i.i.d. Due to the (1, A) selection scheme, from these children, the one minimizing the 
function / is selected: Xt+i = argmin{/(y), 1^ G {Y t^i, ■■■^Y t^x]}- This latter 
equation implicitly defines the random variable as 

Xt+i =Xt + atCt ■ (1) 
In order to adapt the step-size, the cumulative path is defined as 

p,+, = {l-c)p, + ^c{2-c)et (2) 

with < c < 1. The constant 1 /c represents the life span of the information contained 
in P(, as after 1/c generations is multiplied by a factor that approaches 1/e « 0.37 
for c — > from below (indeed (1 — c)^/"^ < exp(— 1)). The typical value for c is between 
l/y/n and 1/n. We will consider that Pq ^ A/'(0, /„) as it makes the algorithm easier 
to analyze. 

The normahzation constant ■\/c(2 — c) in front of in Eq. (2) is chosen so that 
under random selection and if p^ is distributed according to Af{0, In) then also Pf_^,l 
follows Af{0, /„). Hence the length of the path can be compared to the expected length 
of ||A/'(0, /„)|| representing the expected length under random selection. 



The step-size update rule increases the step-size if the length of the path is larger 
than the length under random selection and decreases it if the length is shorter than 
under random selection: 



c \\Pt+i\\ , 

where the damping parameter da determines how much the step-size can change and 
is set to do- — 1- A simplification of the update considers the squared length of the 
path [5]: 

This rule is easier to analyse and we will use it throughout the paper. 

Preliminary results on linear functions. Selection on the linear function, f{x) — [x\i, 
is determined by [Xt]^ + at [£,t]i — [^di + '^t [^t i\ ^ for all i which is equivalent to 
[(.Vii — [it i] 1 for * where by definition [^j j] ^ is distributed according to A/'(0, 1). 
Therefore the first coordinate of the selected step is distributed according to Afi;x and 
all others coordinates are distributed according to Af{0, 1), i.e. selection does not bias 
the distribution along the coordinates 2, . . . ,n. Overall we have the following result. 

Lemma 1. On the linear function f{x) = xi, the selected steps (^t)t(EN of the (1, A)- 
ES are i.i.d. and distributed according to the vector ^ (A/i:A, A/'2, • . • , A/'n) where 
Afi^Af (0,1) fori > 2. 

Because the selected steps are i.i.d. the path defined in Eq. 2 is an autonomous 
Markov chain, that we will denote P = (Pf)teN- Note that if the distribution of the 
selected step depended on {Xt, (Jt) as it is generally the case on non-linear functions, 
then the path alone would not be a Markov Chain, however {Xt^ CtyPt) would be an 
autonomous Markov Chain. In order to study whether the (1, A)-CSA-ES diverges geo- 
metrically, we investigate the log of the step-size change, whose formula can be imme- 
diately deduced from Eq. 3: 

By summing up this equation from to i — 1 we obtain 

'lnf-)=;T^l7>; — -1| ■ (5) 




t \ao 

We are interested to know whether | \n{at/(Jn) converges to a constant. In case this 
constant is positive this will prove that the (1, A)-CSA-ES diverges geometrically. We 
recognize thanks to (5) that this quantity is equal to the sum of t terms divided by t that 
suggests the use of the law of large numbers to prove convergence of (5). We will start 
by investigating the case without cumulation c = 1 (Section 3) and then the case with 
cumulation (Section 4). 



3 Divergence rate of (1, A)-CSA-ES without cumulation 



In this section we study the (1, A)-CSA-ES without cumulation, i.e. c = 1. In this case, 
the path always equals to the selected step, i.e. for all t, we have p^_^_l = . We have 
proven in Lemma 1 that are i.i.d. according to ^. This allows us to use the standard 
law of large numbers to find the limit of j hi(crt/f7o) as well as compute the expected 
log-step-size change. 

Proposition 1. Let :— i-^i-x) ^ l)- linear functions, the (1, \)-CSA- 

ES without cumulation satisfies (i) almost surely limt_>.oo | In (ct/CTo) — and (ii) 
for all t e N, E{\n{at+i/at)) = A„. 

Proof. We have identified in Lemma 1 that the first coordinate of is distributed ac- 
cording to 7Vi:A and the other coordinates according to A/'(0, 1), hence E |P) = 

E {[CAi) + (K?]-) = E {Nl^) + n - 1. Therefore E {Ulf) /n - 1 = 

- l)/n. By applying this to Eq. (4), we deduce that E(ln((Tf+i/fTt) = 
l/(2d<,n)(E(AA2^) - 1). Furthermore, as E(A/'2^) < E((AA/'(0, l))^) = < 00, 
we have E(||^*|p) < 00. The sequence (||^4 ||^)teN being i.i.d according to Lemma 1, 
and being integrable as we just showed, we can apply the strong law of large numbers 
on Eq. (5). We obtain 

The proposition reveals that the sign of (E (J^i-x) — l) determines whether the step- 
size diverges to infinity. In the following, we show that E {Afi-x) increases in A for 
A > 2 and that the (1, A)-ES diverges for A > 3. For A = 1 and A = 2, the step-size 
follows a random walk on the log-scale. 

Lemma 2. Let be independent random variables, distributed according to 

N{Q, I), andUi;\ the i*-^ order statistic o/(7Vi)ie[[i,A]]- Then E {Mi.i) = E (7Vf.2) = 
1. In addition, for all A > 2, E {Mlx+i) > E {Mix). 

Proof, (see [ ] for the full proof) The idea of the proof is to use the symmetry of the 
normal distribution to show that for two random variables U ~ !f'i:A+i and V ^ ^i-x, 
for every event Ei where [/^ < V^, there exists another event E2 counterbalancing the 
effect of £'1, i.e J^^{u^ ^ v'^)fu.v{u,v) dudv — — u^)/(7.y(u, w) dud?j, with 

fu,v the joint density of the couple (U, V). We then have E {Af^.x+i) > E {J^lx). As 
there is a non-negligible set of events £'3, distinct of Ei and E2, where U"^ > V"^, we 
have E(A/'2.^+ J > E(A/'2. J. 

For A = 1, 7Vi:i - A/'(0, 1) so E{Afl^) = L For A = 2 we have E(7V?^2 +-^2:2) = 
2E(A/'(0, 1)^) = 2, and since the normal distribution is symmetric E(7Vf.2) = E(7Vj.2), 
hence E(A/'^2) = 1- n 



We can now link Proposition 1 and Lemma 2 into the following theorem: 

Theorem 1. On linear functions, for A > 3, the step-size of the (1, X)-CSA-ES without 
cumulation (c = 1) diverges geometrically almost surely and in expectation at the rate 
l/(2d<,n)(E(A/'2^) - 1), i.e. 

For A = 1 and \ ~ 2, without cumulation, the logarithm of the step-size does 
an additive unbiased random walk i.e. Incjt+i = InUf + Wt where E\Wt\ = 0. More 
precisely Wt ~ 1/ {2da){xl/n-l) for \ = I, andWt ~ 1/ {2d„){{Nl2+xl-i) /n- 
l)for X = 2, where Xk stands for the chi-squared distribution with k degree of freedom. 

Proof For A > 2, from Lemma 2 we know that E(7Vf.;^) > V.{Mi,2) = 1- Therefore 
'E{Ni.x) — 1 > 0, hence Eq. (6) is strictly positive, and with Proposition 1 we get that 
the step-size diverges geometrically almost surely at the rate l/(2(io-)(E(A/'f.^) — 1). 

WithEq. 4wehaveln(crt+i) = In (cr*) + VFt, with W^t = l/(2d<^)(||^*|| Vti - 1). 
For A = 1 and A = 2, according to Lemma 2, E(T4^i) = 0. Hence hi(crt) does an 
additive unbiased random walk. Furthermore ||^||^ — Af^ ^ + Xn-i^ so for A = 1, since 
Af,.,,=Af{OMUr^xl ' □ 

In [8] we extend this result on the step-size to | [^t]i |, which diverges geometrically 
almost surely at the same rate. 

4 Divergence rate of (1, A)-CSA-ES with cumulation 

We are now investigating the (1, A)-CSA-ES with cumulation, i.e. < c < 1. The path 
is then a Markov chain and contrary to the case where c = 1 we cannot apply a 
LLN for independent variables to Eq. (5) in order to prove the almost sure geometric 
divergence. However LLN for Markov chains exist as well, provided the Markov chain 
satisfies some stability properties: in particular, if the Markov chain T' is cp-irreducible, 
that is, there exists a measure (p such that every Borel set A of M" with (p{A) > has 
a positive probability to be reached in a finite number of steps by P starting from any 
Pq € K". In addition, the chain "P needs to be (i) positive, that is the chain admits 
an invariant probability measure tt, i.e., for any borelian A, tt{A) = /jj„ P(x, A)Tr{A) 
with P{x, A) being the probability to transition in one time step from x into A, and (ii) 
Harris recurrent which means for any borelian A such that ip{A) > 0, the chain V visits 
A an infinite number of times with probability one. Under those conditions, V satisfies 
a LLN, more precisely: 

Lemma 3. / / J, 17.0.1] Suppose that T' is a positive Harris chain with invariant prob- 
ability measure tt, and let g be a n-integrable function such that 
Trdffl) = /r„ \9{x)\TT{dx) < oo. Then l/tJ2l=i 9iPk) ^(d)- 



The path "P satisfies the conditions of Lemma 3 and exhibits an invariant measure [8]. 
By a recurrence on Eq. (2) we see that the path follows the following equation 



= (1 - c)*Po + v/c(2 - c) ^(1 - cf Ct-i-k ■ (7) 



fe=o ^ 

1.1. d. 



For i 1, ^ Af{0,l) and, as also [pgli ^ J\f{0,l), by recurrence [pt]i ~ 

Af{0, 1) for all t e N. For i = 1 with cumulation (c < 1), the influence of [pq]i 
vanishes with (1 — c)*. Furthermore, as from Lemma 1 the sequence ([^(]i])teN is 
independent, we get by applying the Kolgomorov's three series theorem that the series 
X]l=o(l ~ c)*^ [€t-i-fe]i converges almost surely. Therefore, the first component of 
the path becomes distributed as the random variable [Paoli = ■\/c(2 — c) J^'kLoi'^ ~ 
(by re-indexing the variable in as the sequence {^t)teti is i.i.d.). 

We now obtain geometric divergence of the step-size and get an explicit estimate of 
the expression of the divergence rate. 

Theorem 2. The step-size of the (1, X)-CSA-ES with A > 2 diverges geometrically fast 
if c < 1 or A > 3. Almost surely and in expectation we have for < c < 1, 

In 



t 



{v) 2h (2(l-c)E(A/-,.)- + c(E(A/j-J-l))_ . (8) 



>ofor A>3 and for x=2 and c<i 



Proof. For proving almost sure convergence of ln(crt /o'o j/t we need to use the LLN for 
Markov chain. We refer to ["] for the proof that P satisfies the right assumptions. We 
now focus on the convergence in expectation. From Eq. (4) we have E{ln{at+i / crt)) = 

c/(2d,)(E(||p,+i||2)/n - 1), so mPt+i\?) - E(Er=i [Pt+i]]) is the term we have 
to analyse. From Eq. (7) and its conclusions we get that for j ^ 1 [Pt\j ^ A/'(0, 1), so 

[Pi+i] i) ^ {n- 1). When t goes to infinity, the influence of 
[Pq]i in this equation goes to with (1 — c)*+^, so we can remove it when taking the 
limit: 



We win now develop the sum with the square, such that we have either a product 
with i 7^ j, or [Ct'-j jj^- This way, we can separate the variables by 
using Lemma 1 with the independence of ^* over time. To do so, we use the develop- 
ment formula {J2i=i "")^ = '^J2i=i Z]j=i+i Ojaj + J2i=i We take the hmit of 
E( [Pt+i] ^) and find that it is equal to 

/ 



lim c(2— c) 



t t 



i=Oj=i+l 




Now the expected value does not depend on i or j, so what is left is to calculate 
EloE5=.+i(l - and Elo(l - c)'^- We have E:=oE5=.+i(1 " = 

— c)^'+^ and when we separates this sum in two, the right hand side 

goes to for t ^ oo. Therefore, the left hand side converges to limt_j.oo J2l=oi^ ~ 
c)2*+7c, which is equal to limf^oo(l - c)/cX;-=o(l " c)^'- And ELo(1 " c)^' is 
equal to (1 - (1 - c)2*+2)/(l - (1 - c)^), which converges to l/(c(2 - c)). So, by 

inserting this in Eq. (10) we get that E [[Pt+i] f) {Mi-.xf + E {Ml^), 

which gives us the right hand side of Eq. (8). 

By summing E(ln((Ti+i/cri)) for i — 0, . . . ,t — 1 and dividing by t we have the 
Cesaro mean l/tE(ln(fTt/(To)) that converges to the same value that E(ln(crf+i/(Tt)) 
converges to when t goes to infinity. Therefore we have in expectation Eq. (8). 

According to Lemma 2, for A = 2, K{J\fi.2) — 1, so the RHS of Eq. (8) is equal to 
(1 — c) /((io■?^)E(A/l:2)^■ The expected value of Afi:2 is strictly negative, so the previous 
expression is strictly positive. Furthermore, according to Lemma 2, E{J\fl.^) increases 
with A, as does E(A/i:2)^. Therefore we have geometric divergence for A > 2. □ 

From Eq. (1) we see that the behavior of the step-size and of {Xt)teN are directly 
related. Geometric divergence of the step-size, as shown in Theorem 2, means that also 
the movements in search space and the improvements on affine linear functions / in- 
crease geometrically fast. Therefore, as we showed in Theorem 2 geometric divergence 
for the step-size when A > 2 and c < 1, or when A > 3, we expect geometric diver- 
gence on the first dimension of {Xt)tef^ (the first dimension being the only dimension 
with selection pressure). Analyzing {Xt)tef'S with cumulation requires to study a dou- 
ble Markov chain, which is left to possible future research. 

5 Study of the variations of In (crt+i/a-t) 

The proof of Theorem 2 shows that the step size increase converges to the right hand 
side of Eq. (8), for t — > oo. When the dimension increases this increment goes to 
zero, which also suggests that it becomes more likely that 0-4+1 is smaller than at- To 
analyze this behavior, we study the variance of In (at+i/crt) as a function of c and the 
dimension. 

Theorem 3. The variance of In (at+i/crt) equals to 

^"('"(T))^ii?H['"«i')-'^(['-«i9'+^("-i>) ^ <"> 

Furthermore, E [pt+i] 1) ^ E {Mix) + {Ui-.xf and with a = l~c 

lim E ( {Vt+1\ 1 = \ T- (^4 + ^31 + fc22 + ^211 + fcllll) , (12) 

where k^^E[Mty), hi - A^^^i^E (AA^ J E {Mv.x), ^2 - 6^E {^1^)', 
fc2ii = l20^;^E{Nl,)E{Ui..xf and fcnii = 24 ,,_ ,,„^,. E {Ui..xf. 



Proof. 



VaWln(^)].Var(-^(^-l)).— Var 



2d„\n / / Adln'^ 



E(iiPt+ir)-E(iiPt+iP)' 

(13) 

The first part of Var(||p,+i||2), mPt+i\\% is equal to E((ELi [Pt+i]?)')- We de- 
velop it along the dimensions such that we can use the independence of [Pt_|_i]i with 

[Pt+i]jfori 7^ j,togetE(2X;r=iE"=.+i [Pt+i]] [Pt+ilJ+Eili [Pi+i] ■ )■ For * 7^ 
1 [Pt+i] j is distributed according to a standard normal distribution, so E ^ [Pt+i] i 

landE([p,+i]J) =3. 



-i— 1 j^-i-j-l i^l 

n n \ n / \ 

2E E 1 +2EiE([^'m]3+ E3 

^[2j2in-z)^+2in- 1)E ( [p.+J J) + 3(n - 1) + E ( [p.+J ' 

= e([p,+i]:;) +2(n-l)E([p,+i]J) +(n-l)(n + l) 

The other part left is E(||pj^]^|p)^, which we develop along the dimensions to get 
E(Eti [Pt+i& = (E([p,+i] J) + which equals to E([p,+i];)2 + 2(n- 

1)E( [Pf+i] + {n — 1)^. So by subtracting both parts we get 

E(||p,+i r ) - E(||p,+i ^ E( [p,+i] I) - E( [p,+i] J)2 + 2(n - 1), which we insert 
intoEq. (13) to get Eq. (11). 

The development of E( [pj+i] ^) is the same than the one done in the proof of The- 
orem 2. We refer to [8] for the development of E( [pt_|_i] since limits of space in the 
paper prevents us to present it here. □ 

Figure 1 shows the time evolution of ln{at/(Jo) for 5001 runs and c = 1 (left) and 
c = 1/ ^/n (right). By comparing Figure 1 a and Figure lb we observe smaller variations 
of ln(cr(/ao) with the smaller value of c. 

Figure 2 shows the relative standard deviation of \n{at+i/(yt) (i-e. the standard 
deviation divided by its expected value). Lowering c, as shown in the left, decreases 
the relative standard deviation. To get a value below one, c must be smaller for larger 
dimension. In agreement with Theorem 3, In Figure 2, right, the relative standard de- 
viation increases like ^/n with the dimension for constant c (three increasing curves). 
A careful study [8] of the variance equation of Theorem 3 shows that for the choice 
ofc= 1/(1+ n"), if q; > 1/3 the relative standard deviation converges to with 
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(a) Without cumulation (c = 1) (b) With cumulation (c = 1 /\/20) 



Fig. 1: ln(crf/(To) against t. The different curves represent the quantiles of a set of 
5.10'' + 1 samples, more precisely the 10*-quantile and the 1 — 10^*-quantile for i from 
1 to 4; and the median. We have n = 20 and A = 8. 

y/ (n^" + n)/rfi". Taking a = 1 /3 is a critical value where the relative standard devi- 
ation converges to 1/(\/2]E(A/i:a)'^)- On the other hand, lower values of a makes the 
relative standard deviation diverge with n^'^~^°')/'^ . 

6 Summary 

We investigate throughout this paper the (1, A)-CSA-ES on affine linear functions com- 
posed with strictly increasing transformations. We find, in Theorem 2, the limit distri- 
bution for ln(crt/cro)/i and rigorously prove the desired behaviour of a with A > 3 for 
any c, and with A = 2 and cumulation (0 < c < 1): the step-size diverges geometrically 
fast. In contrast, without cumulation (c = 1) and with A = 2, a random walk on \x\{a) 
occurs, like for the (1, 2)-o'SA-ES [9] (and also for the same symmetry reason). We de- 
rive an expression for the variance of the step-size increment. On linear functions when 
c = 1/n", for a > (a = meaning c constant) and for n ^ oo the standard de- 
viation is about y^(n^^M-Ti)7^^ times larger than the step-size increment. From this 
follows that keeping c < ensures that the standard deviation of ln((Tf_|_i /cTf) be- 

comes negligible compared to ln(crt+i/(T() when the dimensions goes to infinity. That 
means, the signal to noise ratio goes to zero, giving the algorithm strong stability. The 
result confirms that even the largest default cumulation parameter c = l/\Ai is a stable 
choice. 
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dimension of the search space 



Fig. 2: Standard deviation of \n{at+i /at) relatively to its expectation. Here A = 8. 
The curves were plotted using Eq. (11) and Eq. (12). On the left, curves for (right to 
left) n = 2, 20, 200 and 2000. On the right, different curves for (top to bottom) c = 1, 
0.5, 0.2, 1/(1 + n^/'^), 1/(1 + n^/^), 1/(1 + n^/^) and 1/(1 + n). 
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